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Executive Summary 

The introduction of computerized adaptive testing (CAT) has made it necessary to build large pools 
of test items with the item statistics (commonly called parameters) needed to describe the characteristics of 
the items. The process of obtaiiring item parameters usually consists of the following two stages: 

1. Pretest stage. In a series of sessions, sets of items are administered to groups of test takers, and a 
mathematical model called item response theory (IRT) is used to obtain estimates of item parameters 
representing such features as item difficulty, discriminating power (the ability of the item to 
distinguish between more and less able test takers), or susceptibility to guessing. 

2. Online stage. The test is operational and administered online but the responses are also used for 
parameter estimation, for example, to keep improving the precision of previous estimates or to 
obtain estimates for new items added to the pool. 

In this paper it is proposed that methods of quality control be used in the calibration process, for 
example, to check if the values of the item parameters have not drifted between the pretest and the online 
stage. If parameter drift is found, the response data cannot be pooled to increase the precision of the 
parameter estimates. Methods of quality control can also be used to detect security breaches in an online 
stage. Three different statistics for quality control are proposed: (1) a Lagrange multiplier (LM) statistic; (2) a 
Wald statistic; and (3) a cumulative sum (CUSUM) statistic. The power of the tests based on these statistics, 
that is, their ability to detect shifts in the parameter values, was evaluated. 

It was found that the tests had moderate to good power to detect shifts in the values of the guessing and 
difficulty parameters. In addition, all tests were equally sensitive to shifts in the values of all parameters, 
even if the null hypothesis of no shift was formulated for only one of them. This result is not surprising 
because estimates of the parameters in the model evaluated are usually highly correlated. The practical 
conclusion from the study is that all of these statistics can be used very well to detect if some^ing has 
happened to the item parameters but that it may be difficult to attribute the problems to specific parameters. 

Abstract 

In computerized adaptive testing, updating item parameter estimates using adaptive testing data is 
often called online calibration. This paper investigates how to evaluate whether the adaptive testing data 
used for online calibration sufficiently fit the item response model used. Three approaches are investigated, 
based on a Lagrange multiplier (LM) statistic, a Wald statistic, and a cumulative sum (CUSUM) statistic. The 
power of the tests is evaluated with a number of simulation studies. 

Introduction 

Computerized assessment, such as CBT (computer based testing) and CAT (computer adaptive testing), 
is based on the availability of a large pool of calibrated test items. Usually, the calibration process consists of 
two stages. 

1. The pretesting stage. In this stage, subsets of items are administered to subsets of respondents in a 
series of pretest sessions, and an item response (IRT) model is fitted to the data to obtain item 
parameter estimates to support computerized test administration. 

2. The online stage. In this stage, data are gathered in a computerized assessment environment. There 
may be several motives for using these data for further parameter estimadon. The interest may 
be to continuously update estimates to attain the greatest possible precision. Or new, previously 
uncalibrated items may be entered into the bank and can only be calibrated using incoming responses. 

Closely related to the motives for online calibration, but also an aim in itself, is quality control, that is, 
checking whether pretest and online results comply with the same IRT model. In the present paper, tbr^ 
methods of quality control are proposed. The first method is based on the Lagrange multiplier statishc (LM). 
The method can be viewed as a generalization to adaptive testing of the modification indices for the 2-PL 
model and the nominal response model introduced by Glas (1997, 1998). The second method is based on a 
Wald statistic. The third method is based on a so-called cumulative sum (CUSUM) statistic. This last 
approach stems from the field of statistical quality control (see, for instance, Wetherill, 197^ Using *is 
method in the framework of IRT-based adaptive testing was first suggested by Veerkamp (1996) m the 
framework of the Rasch model. In this paper, the procedure will be generalized to the 3-PL model. 
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This paper is organized as follows. The section that follows outlines a framework for estimation of the 
2-PL model. This model will subsequently be used for a general introduction of the LM statistic in the next 
section. The LM and the Wald and CUSUM statistics will then be applied to quality control in adaptive 
testing. Section 6 evaluates the performance of the proposed methods with a number of simulation studies. 
Finally, in Section 7 some conclusions and suggestions for further research will be formulated. 

Before proceeding, a remark, with respect to the scope of this paper. Strictly speaking, the methods 
proposed here also apply to a situation where there is no pretest stage and the item bank is bootstrapped 
during the online stage. However, without a pretest stage, in the initial stages of online calibration, tile data 
on some of the items may be prohibitively scarce or even ill-conditioned, in the sense that there is too little 
information in the data to estimate all relevant parameters. Below, it will be assumed that the data are such 
that parameter estimates can be obtained. Generalization of the methods to be proposed to ill-conditioned 
data, probably by introducing prior distributions on the item parameters, is beyond the scope of the present 
paper and will be treated later. Further, it will be assumed that the number of items in the bank is such that 
standard errors of estimates can be computed using the complete information matrix. Also application of the 
procedures to very large item banks, where other approximations to the standard errors have to be made, 
are points of future research. 



Preliminaries 



Consider dichotomous items where responses of persons labeled n to items labeled i are coded Xni - 0, 
and Xni = 1. The, probability of a correct response is given by 

<|.,(0„)=Pr(X„, =l|0„,a,,p,,Y,) 



= Yi +(1-Y/V/(0J 

,, V ex p(g,0„ -p;) 

l + exp(a,.0„ -P,)' (1) 



where 0« is the ability parameter of person n and a P and y , are the discrimination, difficulty, and guessing 
parameter of item i, respectively. Since simultaneous ML estimates of all item parameters are hard to obtain 
(see, for instance, Swaminathan & Gifford, 1986), in the present paper it will be assumed that y, is fixed to 
some plausible constant, say, to the guessing probability. Using priors on y, to facilitate its estimation is a 
topic for future study. Below, the well-known theory of MML estimation for IRT models will be reiterated. In 
this presentation the formalism of Glas (1992, 1997, 1998) will be used, which, as will become apparent in the 
sequel, is especially suited for the introduction of the procedures below. The choice of a distribution of 
ability is not essential to the theory presented here; it can be the parametric MML framework (see Bock & 
Aitkin, 1981) or the nonparametric MML framework (see DeLeeuw & Verhelst, 1986; Follmann, 1988). 
However, to make the presentation explicit, it is assumed that the ability distribution is normal with 
parameters p and o. Further, for reasons of simplicity, it is assumed that all respond erits belong to ^e same 
population. Modem software for the 2- and 3-PL model, such as Bilog-MG (Zimowski, Muraki, Mislevy, & 
Bock, 1996), does not have this restriction, but this generalization is straightforward. So, let g(0n; p, a) be the 
density of 0. Further, let the item administration variable dm take the value one if the item was administered 
to n, and zero if this was not the case. If dm' = 0 it will be assumed that Xm — c, where c is some arbitrary 
constant. 

Let Xn and dn be the response pattern and the item administration vector of respondeirt n, respectively. 
With a reference to the ignorability principle by Rubin (1976), Mislevy (1986) asserts that in adaptive testing 
consistent ML estimates of the model parameters can be obtained maximizing the likelihood of responses Xn 
conditionally on the design d„, that is, the design can be ignored. So, if =(a',p',p,a) is the vector of all 
item and population parameters, the log-likelihood to be maximized can be written as 

lnL(^;X,D) = S„lnPr(x„ld„;^i (2! 



where X stands for the data matrix and D stands for the design matrix. , . . 

To derive the MML estimation equations, it proves convenient to introduce the vector of derivatives 



O 
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b„ (^) = ^ In Pr(x„0 „ t d „ ; ^) = ^ llnPr(x„ I d „ , 0 „ , a,p, y ) + In g(0 „ I p, a)], 



(3) 
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with 






(4) 



Glas (1992, 1997, 1998) adopts an identity due to Louis (1982) to write the first order derivatives of 
Equation 2 with respect to 4 as 



(5) 



This identity greatly simplifies the derivation of the likelihood equations. For instance, using the shorthand 
notation vi/n/ = yf (0n) and (|)iii = <|)i(0n)/ from Equations 3 and 4 it can be easily verified that 



b„(a,) = d,„. 



(3C,„- -<l>„.)(l-Y,)9Vm(l-Vm) 



( 6 ) 



and 



b„(P,) = d„i 






(7) 



The likelihood equations for the item parameters are found upon inserting these expressions into 
Equation 5 and equating these expressions to zero. To derive the likelihood equations for the population 
parameters, using Equation 3 results in 

b„(p) = (0„ 



and 



b„(a) = -a-' +(0„-p)^a-=>. 



(9) 



The likelihood equations are again found inserting these expressions in Equation 5 and equating these 

expressions to zero. _ . . , , . j 

For computing estimation errors, and the LM, Wald, and CUSUM statistics, also the second order 
derivatives of the log-likelihood function are needed. As with the derivation of the estimation equatioris, also 
for the derivation of the matrix of second order derivatives, the theory by Louis (1982) can be used. Using 
Glas (1992), it follows that the observed information matrix, which is the opposite of the matrix of second 
order derivatives, that is. 






5MnL(^;X,D) 
5^ d\' 



( 10 ) 



evaluated using MML estimates, is given by 









( 11 ) 
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where 






d^lnPr(x„,e„ld„;^) 

5 ^' 



( 12 ) 



Uirfortuirately, for the 3-PL model, the exact expressions for the second order derivatives become 
prohibitively complicated. However, Mislevy (1986) points out that the observed information matrix can be 
approximated as 



Simulation studies by Glas (1997) in the framework of the 2-PL model and the nominal response model 
(Bock 1972) show that this approximation is quite good, in the sense that statistics based on this 
approximation attain their theoretical distribution. In the sequel, it will become apparent that this must also 
hold for the 3-PL model. 



Lagrange Multiplier Tests 

Earlier applications of LM tests to the framework of IRT have been described by Glas and Verhelst (1995) 
and Glas (1997, 1998). The principle of the LM test (Aitchison & Silvey, 1958), and the equivalent 
efficient-score test (Rao, 1948) can be summarized as follows. Consider a null-hypothesis about a mo(^l with 
parameters (bn. This model is a special case of a general model with parameters In the present case the 
special model is derived from the general model by fixing one or more parameters to known constairts. Let 
(|) 0 be partitioned as (|)(| = ((|)(n , (|)(| 2 ) = (<|)(i, , c ' ), where c is the vector of the postulated constants ^(i ^) o, ^he 
vLtor of free parameters of the special model. Let ) be the partial derivatives of the log-likehhood of the 
general model, so fi((|) ) = (3 /3(|))lnL(<t.). This vector of partial derivatives gauges the Aange of^e 
log-likelihood as a function of local changes in (|). Let be defined as -(5 /d^d^ )lnL((|)). Then the LM 

statistic is given by 



LM =/i((l)o)' H((bo/<t>o)'’ H^o)- 



(14) 



If Equation 14 is evaluated using the ML estimate of (|)o,and the postulated values of c, it has asymptotic 
y ^ distribution with degrees of freedom equal to the number of parameters fixed (Aitchison & bilvey, 1958 ). 

An important computational aspect of the procedure is that at the point of the ML estunates the free 
parameters have a partial derivative equal to zero. Therefore, Equation 14 can be computed as 



LM{c) = h{c)'W-'h{c) 



(15) 



with 



W =H22(c,c)-H2,(c,$oi)^n($oi'^oi) ' 



(16) 



where the partitioning of H((|) q / o ) H 22 (c / c )/ W 21 (c, ^ oi ^ n oi ' ^ oi )/ ^12 01 ' ^ according to the 

partition(|); =((|)o,,(|)(i 2 ) = (<|>o,/c')- , . v 

Notice that H(4ni /loi ) also plays a role in the Newton-Raphson procedure for solvmg the estimaticm 
equations and in computiion of the observed information matrix. So its inverse will usually by available at 
the end of the estimation procedure. Further, if the validity of the model of the null-hypothesis is tested 

against various alternative models, the computational task is relieved because *e inverse of H(|oi ) is 
a&eady available and the order of W is equal to the number of parameters fixed, which must be small to 

keep the interpretation of the outcome tractable. . , 

^e interpretation of the outcome of the test is supported by observing that the value of Equation 15 
depends on the magnitude of h{c), that is, on the first order derivatives with respect to the parameters *^ 
evaluated in c. If the absolute values of these derivatives are large, the fixed parameters are bound to d^a^ge 
J A they are set free, and the test is significant, that is, the special model is rejected. If the absolute values of 
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these derivatives are small, the fixed parameters will probably show little charrge should they be ^t free, 
that is, the values at which these parameters are fixed in the special model are adequate and e 
significant, therefore, the special model is not rejected. 

Lagrange Multiplier Statistics for Quality Control 

In the introduction section, it was noted that simultaneous ML estimates of all parameters in the 
3 PL model are hard to obtain (see, for instance, Swaminathan & Gifford, 1986). Therefore, m ^e present 
paper it will be assumed that the guessing parameter y, is fixed to some 

nrobabilitv In this section, it will be shown how an LM statistic can be used for testing whether 

SSg param^^^^^^^ and remains appropriate when confronted with the 

^ Coiwider G groups labeled g = 1,. . ., G and y«g = 1 if person n belongs to group g,yng - 0 o&erwise. ^ 

oapS the first group^partakes in the pretesting stage, and the following groups partake m the onlme stage. 

Given this oartition several hypotheses can be tested. For instance, Glas (1998) suggests evaluating DIF by 

SSL te rySEXuteS^paramster, are comlant ovsr groups, U, 

andB =B„fOTg = l,...,G. This can, of course, also be applied m an adaptive testing situation 

for monitoring parameter drift. However, in the present paper, a test for the hypothesis that y , y j, 

for g = 1 G will be given as an example of applying the LM approach to quality control of adaptive 

testing. The LM statistic for testing this hypothesis is based on the first order derivatives with aspect to y 

For using Equation 3, the first order derivatives of Equation 4 with respect to y b« (y f^), need to b 

computed. It is easily verified that 



b/i (y ig) y ig^i 



(x„, -<1>„, )(1-Vm ) 

6.,(l-6w) 



(m 



Let Tj be a vector of the elements, y ,j, g = 1/ • • •/ G. A test for the null-hypothesis y 

LM(r,)=/i(r,yw'fi(r,.) 



= y ^ can be based on 

(18) 



with 

w = Hjj (T,. , r,. ) - H J, (T, , ^ ) H„ (^, ^ )■’ H (^, r, ), 



(19) 



where £ is the vector of the parameters of the null-model. Therefore, Hn(^ %) is the matrix of second order 
derivatives with respect to tiiese parameters, that is, it is equivalent to the matrix defined by Equa^n 10. If 
h(r.) and W are evaluated using MML estimates of the null-model, that is, the estimates of ^ the LM(r,) 
statistic has an asymptotic -distribution with G degrees of freedom. 

A Wald Test and a CUSUM Chart for Quality Control 

The CUSUM chart is an instrument of statistical quality control used for detecting small (Ranges in 
prod^clSSe, Sg Ihe production prornss. The CUSUM d»rl is ^ed in a 

two-sided hypothesis can be evaluated with the Wald statistic 



Q, -KK‘‘ 



•S' 



( 20 ) 
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whe^Weistecovar^ncemahix^W^^ 

item parameters in ^o Suation 13, computed with the MML estimates obtained 

in group VS™group l! respectively. This statistic defined in Equation 20 has an asymptotic x’ distribution 

10 
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with two degrees of freedom. However, the interest is in a one-sided test, so also the signs of the elements of 
dig are needed. Since Equation 20 is a quadratic form, its signed square root is of interest. Further, it may be 
interesting to test the hypothesis iteratively. Therefore, a one-sided cumulative sum chart will be based on 
the quantity 



S,(g) = max 



s,(g-i) + 



U i| U jg ^ P II P is 

Se(aig-a,.,) Se(p„ -p^la,, -a,^) 



( 21 ) 



whereSe(a(j -a,,) = a„ andSe(P„ -P,jla„ , withCTa,Op anda„p 

the appropriate elements of the covariance matrix W;*, which is also used in Equation 20. Further, fc is a 
reference value. The CUSUM chart starts with 

S,.(0) =0, 



and the null-hypothesis is rejected as soon as 



s,(;) > K 



(23) 



where hi is some constant threshold value. The choice of the constants fc and hi determines the power of the 
procedure. In the case of the Rasch model, where the null-hypothesis isp,j -p,, ^ 0, and the term involving 
the discrimination indices is lacking from Equation 21, Veerkamp (1996) successfully uses k - 1/2 and hi 5. 
This choice was motivated by the consideration that this set up has good power against the alternative 
hypothesis of a normalized shift in item difficulty of approximately one standard deviatiori. In the present 
case one extra normalized decision variable is employed, i.e., the variable involving the disaimination indices. 
To have power against a shift of one standard deviation of both normalized decision variables m the direction 
of the alternative hypothesis, a value fc = 1 will be tried out below. The value hi - 5 will not be changed. 



Examples 

In this section, the power of the procedures suggested above will be investigated using a number of 
simulation studies. Since all statistics involve an estimate of the standard error of the parameter estimates, 
and this standard error is approximated using Equation 13, the precision o^is approximation will be 
studied first by assessing the power of the statistics under the null-model. Then the power of the tests will be 

studied under various model violations. , j j i 

For all simulations reported below, the ability parameters 0 were drawn from a standard normal 
distribution. The item difficulties p , were uniformly distributed on [-1.0, 1.0], the discrimination indices a , _ 
were drawn from a log-normal distribution with a zero mean and a standard deviation equal to 0.10, and the 
euessine parameter y . was generally fixed at 0.20. In the online phase, item selection was done usmg the 
maximum information principle. The ability parameter 0 was estimated by its expected a-posteriori value 
(EAP), the initial prior was standard normal. 

The results of eight simulation studies with respect to the power of the statistics under the null-model 
are shown in Table 1. The number of items K in the item bank was fixed at 50 for the first fom studies and at 
100 for the next four studies. Both in the pretest phase and the online phase, test lerigths L of 20 and 40 were 
chosen, the exact setup is shown in the first two columns of Table 1. Finally, in the third column it can be 
seen that the number of respondents per phase was fixed at 500 and 1,000 respondents. So suinmed oyer the 
pretest and online phase, the sample sizes were 1,000 and 2,000 respondents, respectively. For ^e pretest 
phase, a spiraled test administration design was used. For instance, for the K = 50 stud^ies, for tihe pretest 
phase, five subgroups were used, the first subgroup was administered items 1 to 20, the secorij itenw 11 to 
30, the third, itims 21 to 40, the fourth, items 31 to 50, and the fifth group received items 1 to lO and 41 to 50. 
In this manner, all items drew the same number of responses in the pretest phase. For the K - 100 studies, the 
pretest phase consisted of four subgroups administered 50 items. Here the design was 1-50, 26-75, 51-100 
and 1-25 and 76-100. One hundred replications were run for each study. 
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Power ofLM and Wald test under the null-model (100 replications) 

Percentage Significant at 10% 



K 


L 


N, 


LM Test 


Wald Test 


50 


20 


500 


8 


9 






1,000 


10 


10 




40 


500 


9 


10 






1,000 


11 


8 


100 


20 


500 


12 


10 






1,000 


8 


9 




40 


500 


10 


12 






1,000 


10 


10 



Note. K = size of the item pool; L = test length; N, = number of persons 
in calibration and adaptive testing batches. 



The results of the study are shown in the last two columns of Table 1. These columns contain the 
percentages of LM and Wald tests that were significant at the 10% level. It can be seen that the power o the 
te its theoretical value of 10%. Therefore, it can be concluded that the approximahons of the 

standar^^rs w^q j ^ focused on the power in the case that the onliire responses were given 
u4 a^lue for Se guessing parameter ,, that was differ^t from the value of the pretest phase^Results are 
shoiL in Table 2. Thifirst pStel of the table pertains to a situation where, forihe items 5, 10, 15, etc., y, 
changes from 0.00 in the pretest phase to 0.25 in the online phase. So 20 /o of the items do not fit 
null-model of the pretest phase. In the fourth and fifth column, the rejechon rate of aberrant items using a 
Scan7e iLl is Lvm for the LM and Wald test, respectively. The number of ^ep^ations was IM 
It caXseS tL the power of both tests is quite large. Then, for 20 replications, 9 more batches of size N. of 
reSSndents were generated and for each new batch, the CUSUM statistic defined by Equahon 21 was 
comnuted In the last sbc columns the percentage of the detected aberrant items is shown. Non-aberrant 
item? were detected at chance level, in this case 5%. It can be seen that approximately 100 /o of the aberrant 
items are detected after 4 iterations, which can be considered quite good. 
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TABLE 2 

Dpipriinn nf aberrant items: chanses in y,- 


(ver row: 100 replications for LM/Wald and 20 replications for CUSUM) 








Significant at 10% 




CUSUM Detected After Iteration 




K 


L 


N, 


LM Test 


Wald Test 


2 


3 


4 


6 


8 


10 


from y, = 


.00 to y, = 


.25 


















50 


20 


500 


95 


69 


72 


77 


88 


100 


100 


100 






1,000 


100 


70 


85 


90 


100 


100 


100 


100 




40 


500 


100 


100 


77 


83 


99 


100 


100 


100 






1,000 


100 


100 


93 


98 


100 


100 


100 


100 


100 


20 


500 


92 


93 


69 


75 


92 


100 


100 


100 






1,000 


98 


92 


81 


95 


100 


100 


100 


100 




40 


500 


100 


100 


73 


87 


100 


100 


100 


100 






1,000 


100 


100 


88 


99 


100 


100 


100 


100 


from y^ = 


.20 to y, = 


= .30 


















50 


20 


500 


10 


25 


2 


3 


4 


12 


33 


45 






1,000 


40 


60 


2 


4 


4 


35 


58 


66 




40 


500 


31 


22 


3 


3 


4 


22 


44 


65 






1,000 


55 


73 


4 


6 


7 


45 


56 


78 


100 


20 


500 


18 


21 


1 


2 


10 


11 


45 


50 






1,000 


58 


47 


4 


5 


5 


13 


54 


67 




40 


500 


42 


44 


3 


4 


7 


32 


45 


75 






1,000 


49 


77 


2 


6 


7 


22 


50 


76 


from = 


: .20 to y, ^ 


= .40 


















50 


20 


500 


50 


44 


10 


15 


19 


40 


66 


70 






1,000 


90 


60 


12 


18 


22 


50 


81 


82 




40 


500 


89 


97 


18 


26 


33 


76 


89 


100 






1,000 


100 


99 


17 


24 


38 


73 


100 


100 


100 


20 


500 


52 


44 


9 


12 


18 


34 


75 


86 






1,000 


88 


73 


11 


22 


25 


68 


79 


100 




40 


500 


90 


82 


19 


24 


31 


57 


83 


100 






1,000 


100 


100 


18 


29 


30 


77 


100 


100 



The positive picture of the power of the LM, Wald, and CUSUM changes dramatically if y, = .20 changes 
from 0.20 in the pretest phase to 0.30 in the online phase. From the second panel of Table 2, it can be 
that in this case the power of the LM and Wald test is quite low, while even after 10 iterations the CUSUM 
procedure has only detected about half of the aberrant items. In the last panel of Table 2, y , changes from 
0 20 to 0 40 and the power becomes better, although for the L = 20 studies, the power is still quite low. 

Note that in the above simulations, only the LM test is strictly aimed at the alternative that y ^s 
changed. However, the estimates of the three parameters of the 3-PL model are highly correlated. This 
implies that changes in parameters are often confounded and it is very difficult 

parameter that is changing. For instance, if an item becomes known, this can both be translated into an 
Lgmentation of y,, that is, in an augmentation of a correct response unassoaated with 0, m a loss of 
dis^criminating power, and in a lowering of item difficulty. As a con^quence, a test that should be sensitive 
to changes in y , may also have power against changes in a , and p , . The latter case was mvestigated using the 
same simulation se^p as above. The results are displayed in Table 3, the fhst panel pertains to a ch^ge of 
-0.50 in the difficulty of the items 5, 10, 15, 20, and so on; the second panel oertams to a change -1.00 in the 
difficulty of these items. It can be seen that all tests are indeed sensitive to these changes, especially the 
power for the change -1.00 is very high. 
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TABLES . 

Detection of aberrant items: changes in p , (per 

Significant at 


row: 100 
10% 


Tspliccitiotis foT LM/Wuld utid 20 fot CLISUM) 

CUSUM Detected After Iteration 


K 


L 


N. 


LMTest Wald Test 


2 


3 


4 


6 


8 


10 


change -^.50 
50 


20 


i 

500 


25 


24 


12 


17 


33 


65 


96 


100 






1,000 


27 


28 


10 


17 


28 


80 


91 


100 




40 


500 


24 


22 


12 


26 


38 


88 


99 


100 






1,000 


30 


20 


15 


22 


41 


83 


100 


100 


100 


20 


500 


23 


21 


12 


18 


31 


94 


95 


100 






1,000 


44 


33 


5 


20 


32 


78 


89 


100 




40 


500 


50 


42 


19 


24 


54 


87 


100 


100 






1,000 


53 


44 


17 


23 


55 


87 


100 


100 


change -l.OC 
50 


) 

20 


500 


99 


89 


80 


99 


100 


100 


100 


100 






1,000 


90 


90 


85 


90 


100 


100 


100 


100 




40 


500 


89 


96 


87 


83 


100 


100 


100 


100 






1,000 


94 


96 


89 


98 


100 


100 


100 


100 


100 


20 


500 


96 


98 


87 


95 


98 


100 


100 


100 






1,000 


99 


92 


83 


95 


100 


100 


100 


100 




40 


500 


89 


94 


93 


97 


100 


100 


100 


100 






1,000 


99 


99 


98 


99 


100 


100 


100 


100 



Discussion 



This paper explored how to evaluate whether the adaptive testing data used for online calibration 
suffickntivfit the Uem response model used. Three approaches were studied, one based on a 
SSlM) the others on a Wald and a cumulative sum (CUSUM) stahshc, respechvely. 

SScalUvantage of the latter procedure is that it is based on a directional hypothesis can be used 
iteratively The power of the tests was evaluated with a number of simulahon sidles. It was found that the 
pTw^S^he procedures ranged from rather moderate for a change from Y, = 0-20 to y. = 0 30, to good for a 
Aanee from y , = 0 00 to y = 0.25. Further, it was found that the tests are equally sensihve to ohanps m item 
^Sltv ^d the guessing parameter. So the bottom line here is that all these statistics detect that something 
has happened to the parameters, but it will be very difficult to attribute misfit to specific parameters. 
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